Multi-output headed ensembles for product classification

ABSTRACT

An item classification method and system using multi-output headed ensembles, that can include receiving one or more text input sequences at one or more first estimator threads corresponding to the one or more text input sequences. The method can also include tokenizing the one or more text input sequences into one or more first tokens within the one or more first estimator threads. In addition, the method can include outputting one or more item classifications based on an output of the one or more first estimator threads. Further, the method may include applying a backpropagation algorithm to update network weights connecting neural layers in the first estimator threads, defining an optimal setting of network parameters using cross-validation with respect to the first estimator threads, and mapping the one or more first tokens to an embedding space within the one or more first estimator threads.

BACKGROUND Technical Field

The present disclosure described herein relates to product itemclassification for e-commerce catalogs.

Background

This section is intended to introduce the reader to aspects of art thatmay be related to various aspects of the present disclosure describedherein, which are described and/or claimed below. This discussion isbelieved to be helpful in providing the reader with backgroundinformation to facilitate a better understanding of the various aspectsof the present disclosure described herein. Accordingly, it should beunderstood that these statements are to be read in this light, and notas admissions of prior art.

Generally, taxonomy of e-commerce catalogs consists of thousands ofgenres having assigned items that are uploaded by merchants on acontinuous basis. The genre assignments by merchants can often be wrongor incorrect but are treated as ground truth labels in automaticallygenerated training sets, thus creating a feedback loop that can lead topoor model quality over time. The foregoing problem in taxonomyclassification becomes highly pronounced due to the unavailability ofsizable curated training sets. Under such a scenario, it is common tocombine multiple classifiers to combat poor generalization performancefrom a single classifier.

In addition, other factors that contribute to the difficulty of producttaxonomy classification in large-scale e-commerce catalogs include thefollowing: 1) Continuous large-scale manual annotation is infeasible,and data augmentation, semi-supervised and few-shot learning do notprovide any guarantees; 2) the efficacy of data augmentation andsemi-supervised learning methods get severely limited in the presence oflabel noise, which in industrial settings can range around 15%, furtheridentifying the nature of corruption in labels is non-trivial, andinternal assessments revealing that the genre assignment error rate bymerchants is around 20% for the large scale catalog with more than 13Kleaf nodes in the product taxonomy; and 3) there is often an unknowncovariate shift in the final evaluation dataset that consists of theQuality Assurance (QA) team's preferred ways of sampling items includingthose strategies that provide incentives to merchants.

Accordingly, what is needed is a more efficient, faster, and moreaccurate method of product taxonomy classification within catalogs, suchas large-scale e-commerce catalogs. And more particularly, what isneeded is a minimalistic neural network architecture that can takeadvantage of the reduction of estimator variance for ensembles and theadvantages of fusing several classifiers.

BRIEF SUMMARY

In one aspect of the disclosure described herein, a product item andtaxonomy classification method and system, namely, a Multi-Output HeadedEnsemble (MoHE) framework, is disclosed is efficient, effective, fast,accurate, and further utilizes minimum computing resources. Inparticular, the product item classification method and system of thedisclosure described herein provides a lightweight and minimalisticneural network architecture that can take advantage of the reduction ofestimator variance for ensembles and the advantages of fusing severalclassifiers, among other advantages. In addition, the MoHE frameworksystem and method of the disclosure described herein is adaptable toinclude structured metadata, which can be difficult in conventionalheavyweight language models such as BERT. In addition, the disclosuredescribed herein provides a way of measuring label discrepancy betweentraining and evaluation sets using user interactions with a productcatalog.

In addition, an independent ensemble of classifiers often shows higherpredictive variance while classifying out of sample items in a test set.This is generally because the independent classifiers have no way ofexchanging each other's gradient information while optimizing for thesame objective. Here, an MoHE-1 framework system and method of thedisclosure described herein fixes this problem by both fusing the outputlayers of each individual classifiers while averaging the individualpredictions of each classifier including the fusion or aggregatormodule. In addition, an MoHE-2 framework system and method of thedisclosure described herein further adds a mini fusion module withineach individual classifier.

In another aspect of the disclosure described herein, a highly flexible,scalable, and tunable framework is disclosed to add various “expert”classifiers, referred to herein as estimator threads, where individualestimator threads can also be added for various metadata fields. Whilemost neural networks try to perform input representation learningwithout additional domain specific insights on the data, such as thosereflected in the metadata, the MoHE framework system and method of thedisclosure described herein re-enables such effort to be included withinthe neural modeling for better predictive accuracy.

In another aspect of the disclosure described herein, the MoHE frameworksystem and method can be a loosely coupled ensemble framework, whereeach individual classifier's output is considered as a head. Here, eachhead computes the posterior class probabilities when the task beingmodeled is a classification task. In this framework, however, heads aregenerally defined at the output layer. The MoHE model of the disclosuredescribed herein, as a statistical estimator, has lower variance thanjust an independent ensemble of classifiers. In particular, such asreferring to FIGS. 4A-4B, tokenized text can be first converted into anembedding vector via embedding (EMB) modules, which is then encoded viaencoder (ENC) modules using Convolutional Neural Networks (CNNs) with adropout layer. Still referring to FIGS. 4A-4B, a layer normalizer(LayerNorm) can then be applied to the resulting vector from the encodermodules. The aggregator (AGG) network module accepts the concatenationof all such layer normalized vectors and is itself a feed forward neuralnetwork. Each classifier (CLF) module in FIGS. 4A-4B is also a feedforward neural network classifier. The CLF modules together with the AGGmodule in FIGS. 4A-4B constitute the heads of the MoHE model andframework system and method of the disclosure described herein. Stillreferring to FIGS. 4A-4B, the AGG module can act as a small fusionnetwork within the MoHE framework. In addition, each stack ofembeddings, encoder, layer normalizer and classifier can be referred toherein as an estimator thread. Here, the MoHE framework system andmethod of the disclosure described herein can correlate decisions fromindividual classifiers using an aggregator neural network (or aggregatormodule/function) to reduce prediction variance further than thatobtained by using a classifier ensemble alone. Moreover, the MoHEframework is flexible enough to incorporate arbitrarily complex encodersand classifier heads depending on application and business needs. TheMoHE framework also fixes the problem of having just one shared inputfor all estimator threads, which is the case for the MoE model (FIG.3B), among other advantages.

In another aspect of the disclosure described herein, an itemclassification method using multi-output headed ensembles is disclosed.The method can include receiving one or more text input sequences at oneor more first estimator threads corresponding to each of the one or moretext input sequences; tokenizing each of the one or more text inputsequences into one or more first tokens within each of the one or morefirst estimator threads; and outputting one or more item classificationsbased on an output of the one or more first estimator threads. Themethod can also include applying a backpropagation algorithm to updateone or more network weights connecting one or more neural layers in eachof the one or more first estimator threads; defining an optimal settingof network parameters using cross-validation with respect to each of theone or more first estimator threads; and mapping each of the one or morefirst tokens to an embedding space within each of the one or more firstestimator threads. In addition, the method can include defining one ormore hyper parameters using an efficient hyperparameter search techniquewith respect to each of the one or more first estimator threads. Themethod can also include tokenizing each of the one or more text inputsequences into one or more second tokens within one or more secondestimator threads corresponding to each of the second tokens. Further,the method can include determining one or more coordinates for each ofthe one or more second tokens within an embedding space of each of theone or more second estimator threads. The method can also includeencoding the determined one or more coordinates for each of the one ormore second tokens using one or more convolutional neural network (CNN)weights with a dropout layer, thereby resulting in one or more vectorswith respect to each of the one or more second estimator threads.

In addition, the method can include applying a layer normalizer to theone or more vectors to normalize each of the one or more vectors withineach of the one or more second estimator threads; and sending thenormalized one or more vectors from each of the one or more secondestimator threads to an aggregator. Further, the method can includecalculating one or more posterior class probabilities for one or moreoutput heads corresponding to each of the one or more second estimatorthreads. The method can also include obtaining one or more itemclassifications based on the one or more posterior class probabilitiesat each output head for each of the one or more second estimatorthreads. Here, the averaged or summed one or more posterior classprobabilities at each output head can further include an output of theaggregator.

In another aspect of the disclosure described herein, an apparatus forclassifying items using multi-output headed ensembles is disclosed. Theapparatus can include a memory storage storing computer program code;and a processor communicatively coupled to the memory storage, whereinthe processor is configured to execute the computer program code andcause the apparatus to receive one or more text input sequences at oneor more first estimator threads corresponding to each of the one or moretext input sequences; tokenize each of the one or more text inputsequences into one or more first tokens within each of the one or morefirst estimator threads; output one or more item classifications basedon an output of the one or more first estimator threads. In addition,the computer program code, when executed by the processor, further causethe apparatus to apply a backpropagation algorithm to update one or morenetwork weights connecting one or more neural layers in each of the oneor more first estimator threads; define an optimal setting of networkparameters using cross-validation with respect to each of the one ormore first estimator threads; and map each of the one or more firsttokens to an embedding space within each of the one or more firstestimator threads. Further, the computer program code, when executed bythe processor, further cause the apparatus to define one or more hyperparameters using an efficient hyperparameter search technique withrespect to each of the one or more first estimator threads. Also, thecomputer program code, when executed by the processor, further cause theapparatus to tokenize each of the one or more text input sequences intoone or more second tokens within one or more second estimator threadscorresponding to each of the second tokens. In addition, wherein thecomputer program code, when executed by the processor, further cause theapparatus to determine one or more coordinates for each of the one ormore second tokens within an embedding space of each of the one or moresecond estimator threads.

The apparatus can further include wherein the computer program code,when executed by the processor, further cause the apparatus to encodethe determined one or more coordinates for each of the one or moresecond tokens using one or more convolutional neural network (CNN)weights with a dropout layer, thereby resulting in one or more vectorswith respect to each of the one or more second estimator threads. Inaddition, wherein the computer program code, when executed by theprocessor, further cause the apparatus to apply a layer normalizer tothe one or more vectors to normalize each of the one or more vectorswithin each of the one or more second estimator threads; and send thenormalized one or more vectors from each of the one or more secondestimator threads to an aggregator. Further, wherein the computerprogram code, when executed by the processor, further cause theapparatus to calculate one or more posterior class probabilities for oneor more output heads corresponding to each of the one or more secondestimator threads. Also, the computer program code, when executed by theprocessor, further cause the apparatus to obtain the one or more itemclassifications based on the one or more posterior class probabilitiesat each output head for each of the one or more second estimatorthreads.

In another aspect of the disclosure described herein, a non-transitorycomputer-readable medium comprising computer program code forclassifying items using multi-output headed ensembles by an apparatus isdisclosed, wherein the computer program code, when executed by at leastone processor of the apparatus, cause the apparatus to receive one ormore text input sequences at one or more first estimator threadscorresponding to each of the one or more text input sequences; tokenizeeach of the one or more text input sequences into one or more firsttokens within each of the one or more first estimator threads; andoutput one or more item classifications based on an output of the one ormore first estimator threads.

The above summary is not intended to describe each and every disclosedembodiment or every implementation of the disclosure. The Descriptionthat follows more particularly exemplifies the various illustrativeembodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

The following description should be read with reference to the drawings,in which like elements in different drawings are numbered in likefashion. The drawings, which are not necessarily to scale, depictselected embodiments and are not intended to limit the scope of thedisclosure. The disclosure may be more completely understood inconsideration of the following detailed description of variousembodiments in connection with the accompanying drawings, in which:

FIG. 1A illustrates a diagram for one non-limiting exemplary embodimentof a general simplified network architecture of the disclosure describedherein.

FIG. 1B illustrates a block diagram for one non-limiting exemplaryembodiment of a process flow of the disclosure described herein.

FIG. 2 illustrates a block diagram for one non-limiting exemplaryembodiment of aggregator model.

FIG. 3A illustrates a block diagram for one non-limiting exemplaryembodiment of an ensemble model.

FIG. 3B illustrates a block diagram for one non-limiting exemplaryembodiment of a mixture of experts (MoE) model.

FIG. 4A illustrates a block diagram for one non-limiting exemplaryembodiment of the multi-output head ensemble (MoHE-1) of the disclosuredescribed herein.

FIG. 4B illustrates a block diagram for another non-limiting exemplaryembodiment of the multi-output head ensemble (MoHE-2) of the disclosuredescribed herein.

FIG. 5A illustrates a block diagram for another non-limiting exemplaryembodiment of the multi-output head ensemble (MoHE-1, method-1) of thedisclosure described herein having metadata estimator threads.

FIG. 5B illustrates a block diagram for another non-limiting exemplaryembodiment of the multi-output head ensemble (MoHE-2, method-1) of thedisclosure described herein having metadata estimator threads.

FIG. 5C illustrates a block diagram for another non-limiting exemplaryembodiment of the multi-output head ensemble (MoHE-2, method-2) of thedisclosure described herein having metadata estimator threads.

FIGS. 6A-7 illustrates various tables with respect to experimentaltesting data of the disclosure described herein.

FIGS. 8-10 illustrate various charts with respect the experimentaltesting data of the disclosure described herein.

FIGS. 11A-11C illustrate various tables with respect to the experimentaltesting data of the disclosure described herein.

DETAILED DESCRIPTION

The following detailed description of example embodiments refers to theaccompanying drawings. The same reference numbers in different drawingsmay identify the same or similar elements.

The foregoing disclosure provides illustration and description, but isnot intended to be exhaustive or to limit the implementations to theprecise form disclosed. Modifications and variations are possible inlight of the above disclosure or may be acquired from practice of theimplementations. Further, one or more features or components of oneembodiment may be incorporated into or combined with another embodiment(or one or more features of another embodiment). Additionally, in theflowcharts and descriptions of operations provided below, it isunderstood that one or more operations may be omitted, one or moreoperations may be added, one or more operations may be performedsimultaneously (at least in part), and the order of one or moreoperations may be switched.

It will be apparent that systems and/or methods, described herein, maybe implemented in different forms of hardware, firmware, or acombination of hardware and software. The actual specialized controlhardware or software code used to implement these systems and/or methodsis not limiting of the implementations. Thus, the operation and behaviorof the systems and/or methods were described herein without reference tospecific software code—it being understood that software and hardwaremay be designed to implement the systems and/or methods based on thedescription herein.

Even though particular combinations of features are recited in theclaims and/or disclosed in the specification, these combinations are notintended to limit the disclosure of possible implementations. In fact,many of these features may be combined in ways not specifically recitedin the claims and/or disclosed in the specification. Although eachdependent claim listed below may directly depend on only one claim, thedisclosure of possible implementations includes each dependent claim incombination with every other claim in the claim set.

No element, act, or instruction used herein should be construed ascritical or essential unless explicitly described as such. Also, as usedherein, the articles “a” and “an” are intended to include one or moreitems, and may be used interchangeably with “one or more.” Where onlyone item is intended, the term “one” or similar language is used. Also,as used herein, the terms “has,” “have,” “having,” “include,”“including,” or the like are intended to be open-ended terms. Further,the phrase “based on” is intended to mean “based, at least in part, on”unless explicitly stated otherwise. Furthermore, expressions such as “atleast one of [A] and [B]” or “at least one of [A] or [B]” are to beunderstood as including only A, only B, or both A and B.

Reference throughout this specification to “one embodiment,” “anembodiment,” “non-limiting exemplary embodiment,” or similar languagemeans that a particular feature, structure, or characteristic describedin connection with the indicated embodiment is included in at least oneembodiment of the present solution. Thus, the phrases “in oneembodiment”, “in an embodiment,” “in one non-limiting exemplaryembodiment,” and similar language throughout this specification may, butdo not necessarily, all refer to the same embodiment.

Furthermore, the described features, advantages, and characteristics ofthe present disclosure may be combined in any suitable manner in one ormore embodiments. One skilled in the relevant art will recognize, inlight of the description herein, that the present disclosure can bepracticed without one or more of the specific features or advantages ofa particular embodiment. In other instances, additional features andadvantages may be recognized in certain embodiments that may not bepresent in all embodiments of the present disclosure.

In one implementation of the disclosure described herein, a display pagemay include information residing in the computing device's memory, whichmay be transmitted from the computing device over a network to a centraldatabase center and vice versa. The information may be stored in memoryat each of the computing device, a data storage resided at the edge ofthe network, or on the servers at the central database centers. Acomputing device or mobile device may receive non-transitory computerreadable media, which may contain instructions, logic, data, or codethat may be stored in persistent or temporary memory of the mobiledevice, or may somehow affect or initiate action by a mobile device.Similarly, one or more servers may communicate with one or more mobiledevices across a network, and may transmit computer files residing inmemory. The network, for example, can include the Internet, wirelesscommunication network, or any other network for connecting one or moremobile devices to one or more servers.

Any discussion of a computing or mobile device may also apply to anytype of networked device, including but not limited to mobile devicesand phones such as cellular phones (e.g., an iPhone®, Android®,Blackberry®, or any “smart phone”), a personal computer, iPad®, servercomputer, or laptop computer; personal digital assistants (PDAs) such asan Android®-based device or Windows® device; a roaming device, such as anetwork-connected roaming device; a wireless device such as a wirelessemail device or other device capable of communicating wireless with acomputer network; or any other type of network device that maycommunicate over a network and handle electronic transactions. Anydiscussion of any mobile device mentioned may also apply to otherdevices, such as devices including Bluetooth®, near-field communication(NFC), infrared (IR), and Wi-Fi functionality, among others.

Phrases and terms similar to “software”, “application”, “app”, and“firmware” may include any non-transitory computer readable mediumstoring thereon a program, which when executed by a computer, causes thecomputer to perform a method, function, or control operation.

Phrases and terms similar “network” may include one or more data linksthat enable the transport of electronic data between computer systemsand/or modules. When information is transferred or provided over anetwork or another communications connection (either hardwired,wireless, or a combination of hardwired or wireless) to a computer, thecomputer uses that connection as a computer-readable medium. Thus, byway of example, and not limitation, computer-readable media can alsoinclude a network or data links which can be used to carry or storedesired program code means in the form of computer program code or datastructures and which can be accessed by a general purpose or specialpurpose computer.

Phrases and terms similar to “portal” or “terminal” may include anintranet page, internet page, locally residing software or application,mobile device graphical user interface, or digital presentation for auser. The portal may also be any graphical user interface for accessingvarious modules, components, features, options, and/or attributes of thedisclosure described herein. For example, the portal can be a web pageaccessed with a web browser, mobile device application, or anyapplication or software residing on a computing device.

FIG. 1A illustrates one non-limiting exemplary embodiment of a generalnetwork architecture of the network services marketplace platform,process, computing device, apparatus, computer-readable medium, andsystem of the disclosure described herein. In particular, users 110,including user terminals A, B, and C, can be in bi-directionalcommunication over a secure network with central servers or applicationservers 100 of the MoHE framework system and method of the disclosuredescribed herein. Here, servers 100 can include one or more e-commercewebsites or portals. In addition, users 110 may also be in directbi-directional communication with each other via the MoHE frameworksystem and method of the disclosure described herein. Here, users 110may be any type of end user. Each of users 110 can communicate withservers 100 via their respective terminals or portals.

Still referring to FIG. 1A, central servers 100 of the MoHE frameworksystem and method of the disclosure described herein can be in furtherbi-directional communication with admin terminal/dashboard 120. Here,admin terminal/dashboard 120 can provide various tools to a user tomanage any back-end or back-office systems, servers, applications,processes, privileges, and various end users of the disclosure describedherein, or communicate with any of users 110 and servers 100, 130, and140. Central servers 100 may also be in bi-directional communicationwith that of product catalog servers 130, which can include varioustypes of products items, product catalogs, and product taxonomy data.Further, central servers 100 of the disclosure described herein can bein further bi-directional communication with database/third partyservers 140. Here, servers 140 can provide various types of data storage(such as cloud-based storage), web services, content creation tools,data streams, data feeds, and/or provide various types of third-partysupport services to central servers 100 of the MoHE framework system andmethod. However, it is contemplated within the scope of the presentdisclosure described herein that the MoHE framework system and method ofthe disclosure described herein can include any type of general networkarchitecture.

Still referring to FIG. 1A, one or more of servers or terminals ofelements 100-140 may include a personal computer (PC), a printed circuitboard comprising a computing device, a minicomputer, a mainframecomputer, a microcomputer, a telephonic computing device, awired/wireless computing device (e.g., a smartphone, a personal digitalassistant (PDA)), a laptop, a tablet, a smart device, a wearable device,or any other similar functioning device.

In some embodiments, as shown in FIG. 1A, one or more servers,terminals, and users 100-140 may include a set of components, such as aprocessor, a memory, a storage component, an input component, an outputcomponent, a communication interface, and a JSON UI rendering component.The set of components of the device may be communicatively coupled via abus.

The bus may comprise one or more components that permit communicationamong the set of components of one or more of servers or terminals ofelements 100-140. For example, the bus may be a communication bus, across-over bar, a network, or the like. The bus may be implemented usingsingle or multiple (two or more) connections between the set ofcomponents of one or more of servers or terminals of elements 100-140.The disclosure is not limited in this regard.

One or more of servers or terminals of elements 100-140 may comprise oneor more processors. The one or more processors may be implemented inhardware, firmware, and/or a combination of hardware and software. Forexample, the one or more processors may comprise a central processingunit (CPU), a graphics processing unit (GPU), an accelerated processingunit (APU), a microprocessor, a microcontroller, a digital signalprocessor (DSP), a field-programmable gate array (FPGA), anapplication-specific integrated circuit (ASIC), a general purposesingle-chip or multi-chip processor, or other programmable logic device,discrete gate or transistor logic, discrete hardware components, or anycombination thereof designed to perform the functions described herein.A general purpose processor may be a microprocessor, or any conventionalprocessor, controller, microcontroller, or state machine. The one ormore processors also may be implemented as a combination of computingdevices, such as a combination of a DSP and a microprocessor, aplurality of microprocessors, one or more microprocessors in conjunctionwith a DSP core, or any other such configuration. In some embodiments,particular processes and methods may be performed by circuitry that isspecific to a given function.

The one or more processors may control overall operation of one or moreof servers or terminals of elements 100-140 and/or of the set ofcomponents of one or more of servers or terminals of elements 100-140(e.g., memory, storage component, input component, output component,communication interface, rendering component).

One or more of servers or terminals of elements 100-140 may furthercomprise memory. In some embodiments, the memory may comprise a randomaccess memory (RAM), a read only memory (ROM), an electrically erasableprogrammable ROM (EEPROM), a flash memory, a magnetic memory, an opticalmemory, and/or another type of dynamic or static storage device. Thememory may store information and/or instructions for use (e.g.,execution) by the processor.

A storage component of one or more of servers or terminals of elements100-140 may store information and/or computer-readable instructionsand/or code related to the operation and use of one or more of serversor terminals of elements 100-140. For example, the storage component mayinclude a hard disk (e.g., a magnetic disk, an optical disk, amagneto-optic disk, and/or a solid state disk), a compact disc (CD), adigital versatile disc (DVD), a universal serial bus (USB) flash drive,a Personal Computer Memory Card International Association (PCMCIA) card,a floppy disk, a cartridge, a magnetic tape, and/or another type ofnon-transitory computer-readable medium, along with a correspondingdrive.

One or more of servers or terminals of elements 100-140 may furthercomprise an input component. The input component may include one or morecomponents that permit one or more of servers and terminals 110-140 toreceive information, such as via user input (e.g., a touch screen, akeyboard, a keypad, a mouse, a stylus, a button, a switch, a microphone,a camera, and the like). Alternatively or additionally, the inputcomponent may include a sensor for sensing information (e.g., a globalpositioning system (GPS) component, an accelerometer, a gyroscope, anactuator, and the like).

An output component any one or more of servers or terminals of elements100-140 may include one or more components that may provide outputinformation from the device 100 (e.g., a display, a liquid crystaldisplay (LCD), light-emitting diodes (LEDs), organic light emittingdiodes (OLEDs), a haptic feedback device, a speaker, and the like).

One or more of servers or terminals of elements 100-140 may furthercomprise a communication interface. The communication interface mayinclude a receiver component, a transmitter component, and/or atransceiver component. The communication interface may enable one ormore of servers or terminals of elements 100-140 to establishconnections and/or transfer communications with other devices (e.g., aserver, another device). The communications may be effected via a wiredconnection, a wireless connection, or a combination of wired andwireless connections. The communication interface may permit one or moreof servers or terminals of elements 100-140 to receive information fromanother device and/or provide information to another device. In someembodiments, the communication interface may provide for communicationswith another device via a network, such as a local area network (LAN), awide area network (WAN), a metropolitan area network (MAN), a privatenetwork, an ad hoc network, an intranet, the Internet, a fiberoptic-based network, a cellular network (e.g., a fifth generation (5G)network, a long-term evolution (LTE) network, a third generation (3G)network, a code division multiple access (CDMA) network, and the like),a public land mobile network (PLMN), a telephone network (e.g., thePublic Switched Telephone Network (PSTN)), or the like, and/or acombination of these or other types of networks. Alternatively oradditionally, the communication interface may provide for communicationswith another device via a device-to-device (D2D) communication link,such as Flash-LinQ, WiMedia, Bluetooth®, ZigBee®, Wi-Fi, LTE, 5G, andthe like. In other embodiments, the communication interface may includean Ethernet interface, an optical interface, a coaxial interface, aninfrared interface, a radio frequency (RF) interface, or the like.

FIG. 1B illustrates one non-limiting exemplary embodiment of a processfor the MoHE framework system and method of the disclosure describedherein, which can include a training phase that is followed by aclassification phase. With respect to the training phase, the processcan begin at step 200, where one or more raw input sequences are firsttokenized into individual tokens. Here, each estimator thread of theMoHE-1 or MoHE-2 models (FIGS. 4A and 4B) can accept different kinds oftokenized inputs. Here, each token in a particular input is mapped to avector space of high dimension called the embedding space for the input.The process can then move to step 202, where the training happens by astandard backpropagation algorithm to update network weights connectingthe neural layers in each estimator thread as well those that connect tothe aggregator neural network. Here, the classification loss functionapplied at each output is the cross-entropy loss. The process can thenmove to step 204, where the best setting of network parameter parametersis set using cross-validation and hyper parameters are set using anefficient hyperparameter search technique. The process can then move tothe classification phase, or step 206.

Still referring to FIG. 1B, at the classification phase, and at step206, the process can first tokenize the raw input text sequence intoindividual tokens using the same mechanisms that was used duringtraining (i.e., step 200). Here, the tokenized sequences are then fed orsent into the MoHE-1 or MoHE-2 models as inputs, as shown in FIGS. 4Aand 4B. The process can then proceed to step 208. At step 208,corresponding to each token in the sequence input to each estimatorthread, its coordinate in the embedding space (EMB), or vector space, islooked up and combined with other coordinates for the other tokens usinglearnt Convolutional Neural Network (CNN) weights. For each estimatorthread and the aggregator network, the input embeddings are manipulatedwith the different weight parameters of the different neural layers inthe network and the final posterior class probabilities for each head iscomputed. The process can then proceed to step 210, where the finalclassification is obtained by averaging or summing the posterior classprobabilities from the classifier and aggregator network heads.

Here, the MoHE-1 model of the disclosure described herein with CNNencoders was observed to outperform the Mixture of Experts (MoE) modelsignificantly on a classification task, which is to identify specificleaf level genres of items in a product catalog. The significantimprovement was achieved on all segments of the catalog, namely, thehead, torso and tail that constitute the top 70%, next 20%, and thefinal 10% of items by volume. In addition, the MoHE-1 model was observedto significantly outperform the Ensemble framework (FIG. 3A) in the headsegment of the catalog where the data is more concentrated by volume.

In addition, in another non-limiting exemplary embodiment, a variationof the MoHE-1 model, referred to herein as MoHE-2 (FIG. 4B), was alsoshown to provide significant improvement over conventional models. Inparticular, the MoHE-2 model has additional neural layers to enhance therepresentational power of the input even more than MoHE-1. Here, MoHE-2incorporates additional non-linearities within each estimator threadthat act as a mini-aggregator network, which allows for the interactionof information geometries in two spaces, namely, a function of input'sembedding (mean in the minimal framework) and input's encoding spaces.

Further, in other embodiments, metadata can be incorporated into boththe MoHE-1 and MoHE-2 models framework systems and methods of thedisclosure described herein using method-1. In particular, additionalestimator threads can be added to the MoHE model framework for each kindof metadata input. The output from each metadata encoder thread connectsto all the classifier and aggregator neural layers. Here, the wholenetwork can be trained on the training examples using standardbackpropagation algorithm. This method of adding metadata to the MoHE-1and MoHE-2 model frameworks has been referred to herein as method-1. Forexample, FIG. 5A illustrates method-1 as applied to the MoHE-1 model andFIG. 5B illustrates method-1 as applied to the MoHE-2 model.

In another embodiment of method-2 as applied to MoHE-2 shown in FIG. 5C,the output from each metadata estimator thread of the same type inmethod-1 can feed into all the Single Layer Perceptron (SLP) neurallayers of each of the estimator threads for the primary inputs in theMoHE-2 model. This can be performed to have the metadata encodings addto the mini-aggregator network in each of the primary estimator threads.Also, with the addition of Merchant ID, Attribute Tag ID, and itemdescriptions, the MoHE-2 model with the second method (method-2) ofincorporating metadata significantly outperforms all models comparedhere on both the head and torso segments of a tested E-commerce productcatalog.

Generally, ensembles of independent estimators can generalize betterthan an individual estimator in that the variance of the ensembleestimator is better than the worse individual estimator. In particular,independent estimators T that estimate the posterior class probabilitiesby

(x), where

is the training dataset and x is any sample. The estimator with theworst variance g_(i)(x), for some i∈{1, . . . , T}, dropping thesuperscript

where dependence on

is assumed and let this variance be σ², the following can be represented(Equation 1):

${{Var}( {\frac{1}{T}{\sum\limits_{t = 1}^{T}{g_{t}(x)}}} )} = {{\frac{1}{T^{2}}{\sum\limits_{t = 1}^{T}{{Var}( {g_{t}(x)} )}}} = {{\frac{1}{T^{2}}{\sum\limits_{t = 1}^{T}\sigma_{t}^{2}}} \leq {\frac{1}{T}\sigma^{2}}}}$

The mixture of experts (“MoE”) model in the context of a neural networkis a system of “expert(s)” and gating networks with a selector unit thatacts as a multiplexer for stochastically selecting the prediction frombest expert for a given task and input, such as shown in FIG. 3 of anMoE model. However, from a generalization point of view, the MoEclassifier has a much looser bound than an ensemble of i.i.d.estimators. For example, if E_(in)(g) denotes the in-sample (training)error and E_(out)(g) denotes the out-of-sample (test) error, then usingthe union bound of probability, the following can be represented for MoE(Equation 2):

|E _(in)(g)−E _(out)(g)|>ϵ⇒|E _(in)(g ₁)−E _(out)(g ₁)|>ϵ . . . or|E_(in)(g _(T))−E _(out)( 9 _(T))|>ϵ

And applying the Hoeffding Inequality, the following can be represented(Equation 3):

${P( {{❘{{E_{in}(g)} - {E_{out}(g)}}❘} > \epsilon} )} \leq {\sum\limits_{t = 1}^{T}{P( {{❘{{E_{in}(g)} - {E_{out}(g)}}❘} > \epsilon} )}} \leq {2{Te}^{{- 2}\epsilon^{2}N}}$

Where N is the number of in-sample data points. Here, Equation 3 showsthat generalization error bound for MoE can be loose by a factor of T.

FIG. 2 illustrates one non-limiting exemplary embodiment of anaggregator framework with AGG as a “fusion” layer of the disclosuredescribed herein. As shown, AGG does not share the inputs but shares theoutputs from the encoders (ENC₁, ENC₂ . . . ENC_(T)) of the estimatorthreads. Here, the MoHE architecture of the disclosure described hereincan be a coupled ensemble framework where each individual classifier'soutput can be considered as a head. Here, the heads can be defined onlyat the output layer, or the Multi-Output Heads (Output₁, Output₂ . . .Output_(T), Output_(T+1)) as shown in FIG. 4A. In addition, as shown inFIG. 4A, any number of independent input-encoder-output units, which canbe referred to as “estimator threads” or “threads,” are loosely coupledthrough an additional classification module which can be referred to asthe aggregator. Here, the aggregator can perform the functions of afusion module shown in FIG. 2 . Here, each thread is allowed to have itsown unique (and transformed) input, parameters, encoder, and outputlater for single task problems. The MoHE framework system and method ofthe disclosure described herein can be extended to handle multi-taskproblems.

Referring to FIGS. 4A-4B, the number of heads can be T+1 where T is thenumber is the number of threads (estimators) chosen by design and theadditional one is for the aggregator that loosely couples the estimatorthreads. Here, posterior class probability estimates can then beobtained by either taking the output from the aggregator (such as shownin FIG. 2 ) or summing all (or part) of the output probabilities fromthe output heads of the estimator threads including the aggregator.Here, it has been observed that the latter typically outperforms theformer except at early stages of the training or for small trainingdatasets. In addition, the analysis for variance for the MoHE frameworksystem and method of the disclosure described herein can be supported bydistributional support. In particular, for each category k, the outputvector from the heads and the aggregator, g_(k)≡g, follows multivariatenormal distribution. For a particular head t, the covariance and meanfor g can be represented by the equation shown in FIG. 17A (Equation 4),as shown below:

$g = { \begin{bmatrix}g_{t} \\g_{\neg t}\end{bmatrix} \sim{N( {\begin{bmatrix}\mu_{g_{t}} \\\mu_{g_{\neg t}}\end{bmatrix},\begin{bmatrix}{\sum}_{g_{t},g_{t}} & {\sum}_{g_{t},g_{\neg t}} \\{\sum}_{g_{\neg t},g_{t}} & {\sum}_{g_{\neg t},g_{\neg t}}\end{bmatrix}} )}}$

Where

is a T-dimensional vector and g_(t) is a scalar for each class k. Underthis, if all of

is fixed, then the following representation can be shown (Equation 5):

μ_(ℊ_(t)|g_(¬t) = μ_(ℊ_(t)) = Σ_(ℊ_(t, ℊ_(¬t)))Σ_(ℊ_(¬t), ℊ_(¬t))⁻¹(g_(¬t) − μ_(ℊ_(¬t))))∑_(ℊ_(t)❘g_(¬t) = ∑_(ℊ_(t), ℊ_(t))−∑_(ℊ_(t), g_(¬t))Σ_(g_(¬t), g_(¬t))⁻¹∑_(g_(¬t), ℊ_(t))∑_(ℊ_(t), ℊ_(t))−)

In particular,

is positive definite (PD) since

is. This can be shown for an arbitrary PD matrix A and its eigenvalues Λand eigenvectors V:

${Av} = { \lambda_{vv}\Rightarrow{\frac{1}{\lambda_{v}}v}  = {A^{- 1}v}}$

for λ_(v)∈Λ and v∈V. Since

is PD and since

=

, hence by definition of positive definiteness that v^(T)Av

0, there is a reduction of variance for each g_(t), t∈{1, . . . , T+1}and then the foregoing Equation 1 applies. Here, we note that inEquation 5, Σ_(g) _(t) _(,g) _(t) ≡σ_(g) _(t) ² for fixed

.

Here, the MoHE framework system and method of the disclosure describedherein can include encoder threads with arbitrary parameters and inputtokenization. Here, the outputs from all encoders, also referred toherein as CNNs, can be globally max-pooled, concatenated, and given tothe aggregator module, such as shown in FIG. 4A, for the baseline MoHEframework system and method, which may also be referred to herein asMoHE-1. Here, T can be defined to be the number of estimator threads,which can be, for instance, independent classifiers in the baselineensemble framework, such as shown in FIG. 3A. Here, tokenized input textsequences, x_(t) _(i) , which can be pre-processed differently for eachthread so that x_(t) _(i) ≠x_(t) _(j) , are converted to word embeddingvector representations, v_(t)∈R^(L) ^(t) ^(×D) ^(t) , where L_(t) andD_(t) are the input text sequence length and embedding dimension,respectively. Accordingly, the following can be defined (Equation 6):

V _(t) =f _(t,1)(x _(t))=Dropout(Embedding(x _(t)))

Where the second index in f_(t) refers to the depth in the architectureof the estimator thread. Accordingly, the subsequence encoding can berepresented by the following (Equation 7):

u _(t) =f _(t,2)(V _(t))=Dropout (GlobalMaxPool(CNN_(t)(V _(t))))

Where u_(t)∈R^(P) ^(t) where P_(t) is the number of filters for CNN_(t).Accordingly, the estimator thread, t's output can be represented by thefollowing (Equation 8):

g _(t) =f _(t,3)(u _(t))=Softmax(CLF_(t)(u _(t)))

Where CLF_(t) is a densely connected feed forward neural network (FFNN).Similarly, the output of the aggregator module can be represented by thefollowing (Equation 9):

g _(T+1) =f _(T+1,3)({u_(t∈[1, . . . ,T])})=Softmax(CLF_(T+1)(Concatenate(u_(t∈[1, . . . ,T]))))

In addition to the foregoing Equation 9, a layer normalization can beapplied to u_(t) to speed up the convergence and improve performance.Further, Dropouts can appear, as in Equations 6 and 7. In addition,contribution to the training loss function for a single data point canbe represented by the following (Equation 10):

$\mathcal{L} = {{\gamma_{T + 1}{{CE}( {y,g_{T + 1}} )}} + {\sum\limits_{t = 1}^{T}{\gamma_{t}{{CE}( {y,g_{t}} )}}}}$

Where y is the one-hot representation of a label and γ_(T+1)+Σ_(t=1)^(T)γ₁=1 are tuning parameters. Here, the class posterior probabilitiesto be used for classification could be either g_(T+1) or

$\frac{1}{T + 1}{( {g_{T + 1} + {{\sum}_{t = 1}^{T}g_{t}}} ).}$

Here, the MoHE framework system and method can use the latter andfurther set γ_(T+1)=γ_(t)∀t, such as for the experimental data (to bediscussed). Here, an Adam optimizer (except fastText) can be used forthe MoHE framework system and method and not perform parameter tuningspecific to each model or framework in order to focus on the effects ofarchitectural variation.

In another non-limiting exemplary embodiment of the disclosure describedherein, an MoHE-2 model or framework system and method may be used, suchas shown in FIG. 4B. Here, the MoHE-2 framework system and method canincorporate additional non-linearities that can act as a mini-aggregatormodule that allows the interaction of information geometries in twospaces, which can be a function of input's embedding (mean in theminimal framework) and input encoding spaces. For the MoHE-2 framework,the Equation 8 can be represented by the following (Equation 11):

g _(t)=Softmax(CLF_(t)(SLP(Concatenate(u _(t) ,V _(t)))))

Where SLP is a single layer perceptron with tan h activations. Inaddition, Equation 9 can also be changed accordingly for the MoHE-2framework. In particular, Dropouts can appear after (LayerNorm←f_(t)(·))and (LayerNorm←ENC_(t) stacks. Here, CNN can be used as the encoderwithin the disclosure described herein, however, it is contemplatedwithin the scope of the present disclosure described herein that it canbe replaced with other encoders such as RNNs, LSTMs, or transformers,among others. In addition, for the experimental data (to be discussed),seven estimator threads and one aggregator module are used for exemplarypurposes. However, it is contemplated within the scope of the presentdisclosure described herein that any number of estimator threads andaggregator modules may be used.

FIGS. 5A-5C illustrate non-limiting exemplary embodiments of the threadson a right of the estimator thread T that can be “meta estimatorthreads” which take as input any desired metadata referred to herein asMeta Input. Here, at least one advantage of the MoHE framework systemand method of the disclosure described herein can be its ability toaccept domain knowledge as additional metadata. The MoHE frameworksystem and method can add new estimator threads corresponding toindividual or multiple metadata fields, thereby preserving the structureof the data. On the other hand, if rich meta-data is appended to themain text, forming another longer text sequence, as is the case forfastText, then it can lead to loss of structure and strong coupling ofmeta-data parameters. Accordingly, the MoHE framework of the disclosuredescribed herein can receive auxiliary information, or the products'metadata in two different ways or two different methods. The firstmethod, which is referred to herein as method-1 as applied to MoHE-1, isshown in FIG. 5A. For the MoHE-1 model for method-1, the meta-datainputs are embedded, encoded, and the encodings concatenated with theinputs to all classifiers (CLF layers) including the aggregator module(AGG). Here, Multiple types of metadata can be given to a singlemetadata estimator thread or to separate metadata estimator threadsdepending on the data and/or encoder types. The second method, which isreferred to herein as method-2 as applied to MoHE-2, is shown in FIG.5C. For the MoHE-2 model for method-2, the metadata threads can beidentical to that of MoHE-1 with the exception of their outputs beinggiven to SLPs in the MoHE-2 model, instead of directly to theclassifiers. Here, the aggregator module does not take any input fromthe metadata threads in this case. Accordingly, the system and method ofthe disclosure described herein can employ basic text (one-dimensional)CNNs with a kernel size of one for the metadata encoders.

For the experimental data with respect to any of the foregoing modelsfor the MoHE framework, such as MoHE-1 and MoHE2, the metadata orfeatures that used appear only in one of the datasets, namely, a largescale Japanese product catalog (E-commerce 1), for exemplary purposes.Here, for the experimental data, there are multiple metadata valuesavailable for each item, such as various identification numbers relatedto the products, description, price, tags, image URLs, etc. For example,many merchant/shops sell products in only certain categories, and there“shop_ID” can be a strong feature for label correlation. A similarsignal can be “tag_ID,” which can refer to an attribute type of aproduct. Within the experimental testing of the disclosure describedherein, the maker/brand and shop tags are used as features anddescriptions as another metadata feature.

As previously disclosed herein, the meta estimator threads employ CNNswith kernel sizes of one as their encoders, so as to make them serve askeyword finders. For “descriptions,” however, the nouns, adjectives, andadverbs, and omitted repeating words are kept. The descriptions cantherefor be a sequence of part-of-speech tagged tokens and the windowsize is set to one as well. This “feature engineering” of descriptionfits long sentences within a maximum length≤120. In addition, atokenizer is used for tokenizing and extracting parts of speech fromJapanese product titles and descriptions. Accordingly, the Table 1 ofFIG. 11C shows that using metadata for the MoHE-1 and MoHE-2 models,performance on the validation set improves by 3% absolute in macro F1scores.

Table 1 of FIG. 6A illustrates baseline thread parameters for theexperimental test. Here, the thread indices are ordered from left toright, such as shown in FIGS. 4A-5C. Further, the input sequence lengthsare set to 60 for word based tokenization and 100 for character basedtokenization since greater than 90% of titles of shorter than 60 wordsand 100 characters in length. Still referring to FIG. 6A, for theE-commerce 2 dataset, the default settings of CNN Kernel sizes forcharacter tokenization are small than for E-commerce 1 since the averagelength of English words are ≈5 characters and sequential multiples of 5were used. Further, “bi-grams” is by tokens.

In addition, for all of the experimental tests, the E-commerce 1 datasetwas partitioned into training, development, and validation sets, all ofwhich are sampled from the same data distribution. This distribution ofitems have no sampling bias in terms of purchase behavior and includes alarge sample of items from purchased and non-purchased items and a minorpercentage of historical curated items whose genres have been manuallycorrected. The data has noisy labels to the extent of 20% based oninternal assessment. Further, a sample of genres was used based onpurchased items from user sessions to validate this 20% figure. Inaddition, a non-overlapping evaluation set for the E-commerce 1 datasetwas used, where annotators have sampled items based Gross MerchandiseSale (GMS) values and corrected mis-predicted genres from a previousmodel. However, for the experiments with the E-commerce 1 dataset, thevalidation set for model comparison was used. For the E-commerce 2dataset, the challenge set was set to 200K items.

With respect to configurations for the MoHE threads, each estimatorthread t is an embedding, encoder, and classifier stack with outputlater g_(t), which can be represented by the following:

g _(t)=CLF_(t,3)(LayerNOrm(ENC_(t,2)(EMB_(t,1)(x))))

Here, each thread has different parameters and input tokenization typesas summarized in Table 1 of FIG. 6A. Further, the parameter values areobtained using minimal manual tuning over a development set for theEnsemble model. Here, the word embedding dimension is set to

$\min( {\frac{C}{2},100} )$

where C is the number of leaf nodes for each level one genre. Thissetting substantially reduces the number of parameters. Further, thisembedding dimension is set for every model framework except fastText NNIand GCP AutoML. Finally, the dropout values are set to 0.1. Further,incrementally adding seven estimator threads to all models wereexperimented with, with the results shown in FIGS. 8-10 . Here, thebaseline configurations are used for building models for both E-commerce1 and E-commerce 2 datasets. For these experiments, theparameters/properties of the estimator threads were not tuned. However,it is contemplated within the scope of the present disclosure describedherein that tuning may also be performed.

For the experimental tests and evaluation of the MoHE framework of thedisclosure described herein, Macro-F1 scores are used which induce equalweighting of genre performance and hence are a much stricter standardthan other types of scores. In addition, for all models except AutoMLand fastText NNI, the scores reported are averages of five runs. It isnoted that the Ensemble of the disclosure described herein of CNNs base(i.e., MoHE without the coupling) is a strong classifier andsignificantly outperforms the MoE (FIG. 3B) and Aggregator baselines.Further, MoHE-2 was observed to significantly outperform Ensemble on thevalidation set.

For the experimental testing, a BERT model as compared to the MoHE-2model of the disclosure described herein was used for a preliminarycomparison on randomly selected 10% of level one genres from theE-commerce 1 dataset and all genres from the E-commerce 2 dataset. Here,the BERT model can be the model disclosed within Jacob Devlin, Ming-WeiChang, Kenton Lee, and Kristina Toutanova. 2019, BERT: Pre-training ofDeep Bidirectional Transformers for Language Understanding, NAACL-HLT,Association for Computational Linguistics, 4171-4186. Table 2 of FIG. 6Billustrates the BERT v. MoHE-2 comparison on four (4) randomly selectedL1 genres from the E-commerce 1 dataset and full E-commerce 2 dataset.Here, the bold values denote cases where MoHE-2 significantlyoutperformed BERT at a 95% confidence interval for bootstrap sampling,and for most genres along all aspects of Macro-F1, compute time, andmodel size. Here, one of the main drawbacks with the BERT model is thatis a more generalized multi-task model where fine-tuning is dependent tospecific objectives of next word prediction based on a suitably chosencontext. For the case of classification of item titles, the NextSentence Prediction (NSP) objective of BERT is irrelevant if thedisclosure described herein is to even pre-train on item titles.

The experimental testing using the MoHE framework system and method ofthe disclosure described herein can start with analyzing the importanceof adding successive estimator threads to the different models andcompare the graphs in the three plots shown in FIGS. 8, 9, and 10 . Inparticular, FIGS. 8, 9, and 10 illustrate plots of Macro-F1 values forthe MoE, Aggregator, Ensemble, MoHE-1, and MoHE-2 models from level onegenre path classifiers for the E-commerce 1 dataset. Here, the leafnodes for classification correspond to the level one genres, which areorganized into head, torso, and tail segments. Overall, there are 38level one genres and hence 38 groups of level one classifiers. Here,each of the 38 groups represents a set of estimator threadscorresponding to a particular classification framework. In particular,FIG. 8 illustrates Macro-F1 values plotted against the number of threadsfor the head genres. FIG. 9 illustrates Macro-F1 values plotted againstthe number of threads for the torso genres. FIG. 10 illustrates Macro-F1values plotted against the number of threads for the tail genres.

As previously discussed, the MoE model is still a single classifier andhas a loose generalization bound and performs worst among all the modelscompared. Further based on the MoE model, the testing uses only one typeof input that is shared with the “experts” and the “gate” and furtherchoose the configuration shown for estimator thread one in Table 1 ofFIG. 6A. Because of this constraint, it also does not show muchvariation in performance since the estimator threads differ only ininitialization of the input embedding. Here, MoE suffers from bias ininput selection that may also explain its poor performance. Theclassification performance shown in FIGS. 8-10 with regards to Maco-F1scores for the level one genre paths of the head segment isoverwhelmingly dominated by the MoHE-2 model of the disclosure describedherein. For the level one genre paths that belong to the torso and tailsegments, MoHE-2 also outperform Ensemble at seven estimator threads.Further, the additional mini-aggregators introduced in MoHE-2 showimprovements. In addition, classification using only the Aggregatormodule of the MoHE framework of the disclosure described herein wasshown to be an improvement over the MoE model where all the “expert”decisions are fused.

Table 3 of FIG. 7 illustrates the comparisons of MoHE and MoE,Aggregator framework (FIG. 2 ), GCP, AutoML, fastText, fastTextAutotuned with NNI, and the Ensemble framework. In particular, Table 3of FIG. 7 illustrates baseline model performance comparisons(Micro-F1/Macro-F1) for the representative nine genres from thevalidation set. Here, GCP AutoML ignores rare categories while training.The support set for categories during its evaluation is thus smallerleading to higher Micro-F1 scores being reported by GCP AutoML. Further,the numbers in bold for the MoHE-2 column are statistically significantto both fastText Autotune NNI and Ensemble under Bootstrap Sampling testwith 95% confidence interval. Further, it is noted that MoE, Aggregator,fastText, Ensemble, MoHE-1, and MoHE-2 are not tuned to individualgenres. Still referring to Table 3 of FIG. 7 , the level one genres arefirst sorted in descending order of item frequency and segment them intohead, torso, and tail segments. Next, nine categories are chosen, withthe largest three, each from head, torso, and tail segments. Next, GCPAutoML and fastText tuned with Microsoft's NNI are compared. The ninecategories have been chosen to run GCP AutoML. Next, GCP AutoML is runfor at most a day for each of the nine genres. Further, fastTextAutotune was not found to be stable for the E-commerce 1 dataset. It isnoted that GCP AutoML constrains the volume of data ingestion, includingskipping rare categories thereby hindering a fair comparison. It alsoreports Micro-F1 scores in batch mode and obtaining Macro-F1 scoresincur additional cost and thus they are not reported in Table 3 of FIG.7 . Hence, GCP AutoML is dropped from further comparisons.

During the experimental testing, it was observed that the MoHEframeworks of the disclosure described herein with the default settings(Table 1 of FIG. 6A) often perform better than other baselines despitethe fact that they consist of lightweight CNN architectures withoutbeing tuned for a specific genre or dataset. The gains are obtained morefor the head and torso genres and since the category imbalance were notspecifically modeled, the performance on the tail categories are notsignificantly better to both fastText Autotune NNI and Ensemble but tothe underlined one. The performance of the MoHE-2 model is even betterfor the E-commerce 2 dataset that has much less label noise and lowernumber of classes.

The quantitative evaluations for the models can be summarized, such asthe MoE, Aggregator, FastText AutoTune NNI, Ensemble, MoHE-1, and MoHE-2models. The MoHE frameworks or models can be compared without addingmetadata for the E-commerce 1 dataset to be fair to the E-commerce 2dataset, which does not carry any metadata. Table 4 of FIG. 11A showsthe comparative performance of the MoHE models against the baselines orother models. In particular, Table 4 of FIG. 11A illustrates Macro-F1scores from the classifiers discussed herein on the validation set fromthe E-commerce 1 dataset. In particular, the MoHE-2 model frameworkperformed the best. Table of FIG. 11B illustrates Macro-F1 scores fromthe models and frameworks discussed herein for the test from theE-commerce 2 dataset. For obtaining results from the E-commerce 2dataset, as shown in Table 5 of FIG. 11A, the classifiers were set up asflat classifiers. In this case as well, the MoHE-2 model of thedisclosure described herein outperformed all other models andframeworks. It is noted that in Table 3 of FIG. 7 , GCP AutoML shows thehighest Micro-F1 for this dataset due to a smaller support set.

Next, the MoHE framework system and method of the disclosure describedherein can be compared with and without the use of metadata features. Inparticular, Table 6 of FIG. 11C illustrates Macro-F1 values for theMoHE-1 and MoHE-2 classifiers without and with metadata for level onegenres in the validation set. Further, notations for the added meta datavalues are meta-1 (shop_ID_, meta-2 (shop_ID+tag_ID), and meta-3(shop_ID+tag_ID+description). As previously noted, method-1 and method-2are two different ways of adding metadata to the MoHE frameworks. Basedon ablation studies shown in Table 6 of FIG. 11C, both “shop_ID” and“tag_ID” turn out to have strong correlations with labels. Effectivenessdescriptions largely depends on genres, yet including tokens fromdescriptions with chosen parts of speech improves overall performance.By utilizing all three types of metadata, the largest level one genresgain 2-3% macro-F1 performance depending on the framework, and it hasbeen observed that some of the tail genres gain more than 10%. Further,for exemplary purposes, the values for “shop_ID” and “tag_ID” were givento the same metadata thread while description was given to a separatemetadata estimator thread. Table 6 of FIG. 11C illustrates that for allhead, torso, and tail segments for L1 genres, MoHE-2, meta-3 (method-2)performs best although not statistically from MoHE-1, meta-3 (method-1)for the tail segment.

It is understood that the specific order or hierarchy of blocks in theprocesses/flowcharts disclosed herein is an illustration of exampleapproaches. Based upon design preferences, it is understood that thespecific order or hierarchy of blocks in the processes/flowcharts may berearranged. Further, some blocks may be combined or omitted. Theaccompanying method claims present elements of the various blocks in asample order, and are not meant to be limited to the specific order orhierarchy presented.

Some embodiments may relate to a system, a method, and/or a computerreadable medium at any possible technical detail level of integration.Further, one or more of the above components described above may beimplemented as instructions stored on a computer readable medium andexecutable by at least one processor (and/or may include at least oneprocessor). The computer readable medium may include a computer-readablenon-transitory storage medium (or media) having computer readableprogram instructions thereon for causing a processor to carry outoperations.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program code/instructions for carrying out operationsmay be assembler instructions, instruction-set-architecture (ISA)instructions, machine instructions, machine dependent instructions,microcode, firmware instructions, state-setting data, configuration datafor integrated circuitry, or either source code or object code writtenin any combination of one or more programming languages, including anobject oriented programming language such as Smalltalk, C++, or thelike, and procedural programming languages, such as the “C” programminglanguage or similar programming languages. The computer readable programinstructions may execute entirely on the user's computer, partly on theuser's computer, as a stand-alone software package, partly on the user'scomputer and partly on a remote computer or entirely on the remotecomputer or server. In the latter scenario, the remote computer may beconnected to the user's computer through any type of network, includinga local area network (LAN) or a wide area network (WAN), or theconnection may be made to an external computer (for example, through theInternet using an Internet Service Provider). In some embodiments,electronic circuitry including, for example, programmable logiccircuitry, field-programmable gate arrays (FPGA), or programmable logicarrays (PLA) may execute the computer readable program instructions byutilizing state information of the computer readable programinstructions to personalize the electronic circuitry, in order toperform aspects or operations.

These computer readable program instructions may be provided to aprocessor of a general purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer readable media according to variousembodiments. In this regard, each block in the flowchart or blockdiagrams may represent a module, segment, or portion of instructions,which comprises one or more executable instructions for implementing thespecified logical function(s). The method, computer system, and computerreadable medium may include additional blocks, fewer blocks, differentblocks, or differently arranged blocks than those depicted in theFigures. In some alternative implementations, the functions noted in theblocks may occur out of the order noted in the Figures. For example, twoblocks shown in succession may, in fact, be executed concurrently orsubstantially concurrently, or the blocks may sometimes be executed inthe reverse order, depending upon the functionality involved. It willalso be noted that each block of the block diagrams and/or flowchartillustration, and combinations of blocks in the block diagrams and/orflowchart illustration, can be implemented by special purposehardware-based systems that perform the specified functions or acts orcarry out combinations of special purpose hardware and computerinstructions.

It will be apparent that systems and/or methods, described herein, maybe implemented in different forms of hardware, firmware, or acombination of hardware and software. The actual specialized controlhardware or software code used to implement these systems and/or methodsis not limiting of the implementations. Thus, the operation and behaviorof the systems and/or methods were described herein without reference tospecific software code—it being understood that software and hardwaremay be designed to implement the systems and/or methods based on thedescription herein.

What is claimed is:
 1. An item classification method using multi-outputheaded ensembles, the method performed by at least one processor andcomprising: receiving one or more text input sequences at one or morefirst estimator threads corresponding to the one or more text inputsequences; tokenizing the one or more text input sequences into one ormore first tokens within the one or more first estimator threads; andoutputting one or more item classifications based on an output of theone or more first estimator threads.
 2. The method of claim 1, furthercomprising: applying a backpropagation algorithm to update one or morenetwork weights connecting one or more neural layers in the one or morefirst estimator threads; defining an optimal setting of networkparameters using cross-validation with respect to the one or more firstestimator threads; and mapping the one or more first tokens to anembedding space within the one or more first estimator threads.
 3. Themethod of claim 1, further comprising: defining one or more hyperparameters using an efficient hyperparameter search technique withrespect to the one or more first estimator threads.
 4. The method ofclaim 1, further comprising: tokenizing the one or more text inputsequences into one or more second tokens within one or more secondestimator threads corresponding to the second tokens.
 5. The method ofclaim 4, further comprising: determining one or more coordinates for theone or more second tokens within an embedding space of the one or moresecond estimator threads.
 6. The method of claim 5, further comprising:encoding the determined one or more coordinates for the one or moresecond tokens using one or more convolutional neural network (CNN)weights with a dropout layer, thereby resulting in one or more vectorswith respect to the one or more second estimator threads.
 7. The methodof claim 6, further comprising: applying a layer normalizer to the oneor more vectors to normalize the one or more vectors within the one ormore second estimator threads; and sending the normalized one or morevectors from the one or more second estimator threads to an aggregator.8. The method of claim 7, further comprising: calculating one or moreposterior class probabilities for one or more output heads correspondingto the one or more second estimator threads.
 9. The method of claim 8,further comprising: obtaining the one or more item classifications basedon the one or more posterior class probabilities at the output heads forthe one or more second estimator threads.
 10. The method of claim 9,wherein the one or more posterior class probabilities at the outputheads further comprise an output of the aggregator.
 11. An apparatus forclassifying items using multi-output headed ensembles, the apparatuscomprising: a memory storage storing computer program code; and at leastone processor communicatively coupled to the memory storage, wherein theprocessor is configured to execute the computer program code andincludes: receive one or more text input sequences at one or more firstestimator threads corresponding to the one or more text input sequences;tokenize the one or more text input sequences into one or more firsttokens within the one or more first estimator threads; and output one ormore item classifications based on an output of the one or more firstestimator threads.
 12. The apparatus of claim 11, wherein the computerprogram code, when executed by the processor, further causes theapparatus to: apply a backpropagation algorithm to update one or morenetwork weights connecting one or more neural layers in the one or morefirst estimator threads; define an optimal setting of network parametersusing cross-validation with respect to the one or more first estimatorthreads; and map the one or more first tokens to an embedding spacewithin the one or more first estimator threads.
 13. The apparatus ofclaim 11, wherein the computer program code, when executed by theprocessor, further causes the apparatus to: define one or more hyperparameters using an efficient hyperparameter search technique withrespect to the one or more first estimator threads.
 14. The apparatus ofclaim 11, wherein the computer program code, when executed by theprocessor, further causes the apparatus to: tokenize the one or moretext input sequences into one or more second tokens within one or moresecond estimator threads corresponding to the second tokens.
 15. Theapparatus of claim 14, wherein the computer program code, when executedby the processor, further causes the apparatus to: determine one or morecoordinates for the one or more second tokens within an embedding spaceof the one or more second estimator threads.
 16. The apparatus of claim15, wherein the computer program code, when executed by the processor,further causes the apparatus to: encode the determined one or morecoordinates for the one or more second tokens using one or moreconvolutional neural network (CNN) weights with a dropout layer, therebyresulting in one or more vectors with respect to the one or more secondestimator threads.
 17. The apparatus of claim 16, wherein the computerprogram code, when executed by the processor, further causes theapparatus to: apply a layer normalizer to the one or more vectors tonormalize the one or more vectors within the one or more secondestimator threads; and send the normalized one or more vectors from theone or more second estimator threads to an aggregator.
 18. The apparatusof claim 17, wherein the computer program code, when executed by theprocessor, further causes the apparatus to: calculate one or moreposterior class probabilities for one or more output heads correspondingto the one or more second estimator threads.
 19. The apparatus of claim18, wherein the computer program code, when executed by the processor,further causes the apparatus to: obtain the one or more itemclassifications based on the one or more posterior class probabilitiesat the output heads for the one or more second estimator threads.
 20. Anon-transitory computer-readable medium comprising computer program codefor classifying items using multi-output headed ensembles by anapparatus, wherein the computer program code, when executed by at leastone processor of the apparatus, cause the apparatus to: receive one ormore text input sequences at one or more first estimator threadscorresponding to the one or more text input sequences; tokenize the oneor more text input sequences into one or more first tokens within theone or more first estimator threads; and output one or more itemclassifications based on an output of the one or more first estimatorthreads.